Causal language modeling
There are two types of language modeling, causal and masked.
Causal language models are frequently used for text generation.
Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left.
This means the model cannot see future tokens.
GPT-2 is an example of a causal language model.
Now create a batch of examples using DataCollatorForLanguageModeling. (Preprocess)
Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:
code:PyTorchの例.py
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)